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Abstract 

Purpose  -  This  paper  recommends  the  use  of  Internet  data  for  social  sciences  with  a 
special  focus  on  human  resources  issues.  It  discusses  the  potentials  and  challenges  of 
Internet  data  for  social  sciences.  We  present  a  selection  of  the  relevant  literature  to 
establish  the  wide  spectrum  of  topics,  which  can  be  reached  with  this  type  of  data,  and 
link  them  to  the  papers  in  this  International  Journal  of  Manpower  Special  Issue. 
Design/methodology/approach  -  Internet  data  are  increasingly  representing  a  large 
part  of  everyday  life,  which  cannot  be  measured  otherwise.  The  information  is  timely, 
perhaps  even  daily  following  the  factual  process.  It  typically  involves  large  numbers  of 
observations  and  allows  for  flexible  conceptual  forms  and  experimental  settings. 
Findings  -  Internet  data  can  successfully  be  applied  to  a  very  wide  range  of  human 
resource  issues  including  forecasting  (e.g.  of  unemployment,  consumption  goods, 
tourism,  festival  winners  and  the  like),  nowcasting  (obtaining  relevant  information 
much  earlier  than  through  traditional  data  collection  techniques),  detecting  health 
issues  and  well-being  (e.g.  flu,  malaise  and  ill-being  during  economic  crises), 
documenting  the  matching  process  in  various  parts  of  individual  life  (e.g.  jobs, 
partnership,  shopping),  and  measuring  complex  processes  where  traditional  data  have 
known  deficits  (e.g.  international  migration,  collective  bargaining  agreements  in 
developing  countries).  Major  problems  in  data  analysis  are  still  unsolved  and  more 
research  on  data  reliability  is  needed. 

Research  limitations/implications  -  The  data  in  the  reviewed  literature  are 
unexplored  and  underused  and  the  methods  available  are  confronted  with  known  and 
new  challenges.  Current  research  is  highly  original  but  also  exploratory  and  premature. 
Originality/value  -  Our  article  reviews  the  current  attempts  in  the  literature  to 
incorporate  Internet  data  into  the  mainstream  of  scholarly  empirical  research  and 
guides  the  reader  through  this  Special  Issue.  We  provide  some  insights  and  a  brief 
overview  of  the  current  state  of  research. 

Keywords:  World  Wide  Web,  web  data,  internet  data,  forecasting,  human  resources  and 
the  internet 
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1.  Introduction 


The  digital  revolution,  a  lose  synonym  of  digital  age  or  information  age  marks  the  use  of 
digital  information  as  opposed  to  analog  or  mechanical.  Starting  in  the  1990s,  it  is 
impacting  our  world  in  a  completely  new  and  different  way  and  will  continue  to  be 
doing  so  in  the  foreseeable  future.  It  places  networked  computing  into  an  increasing 
number  of  objects  that  are  neatly  integrated  into  our  daily  lives  thus  creating  a  data 
driven  society  and  a  data  driven  economy. 

This  offers  the  perspective  of  a  complete  recording  of  all  aspects  of  our  life.  Since 
demand  and  supply,  all  activities  of  companies  and  individuals,  as  well  as  the  matching 
process  are  documented  in  the  Internet,  not  only  the  complete  market  economy  but  also 
all  social  aspects  are  reflected  in  a  huge  data  cloud  (big  data).  When  it  is  accessible  by 
social  scientists,  this  provides  a  universe  of  research  potentials.  The  past  cannot  only  be 
not  forgotten,  it  can  be  restudied  again  and  again  giving  historical  analysis  a  completely 
new  perspective.  This  suggests  a  new  survey  design,  since  the  Internet  provides 
answers  to  questions  before  they  are  asked. 

The  real  life  of  data  analysts  and  social  scientists  is  not  yet  so  far,  although  we  are 
getting  there  much  faster  than  we  think.  We  already  see  the  rapid  evolution  of  online 
markets  for  all  kinds  of  goods  and  services,  in  particular  jobs.  And  the  social  media 
generate  huge  amounts  of  information  about  individual  preferences  and  behavior 
(Askitas,  2014).  The  social  component  will  accelerate  and  become  more  important  as 
technology  becomes  miniaturized  and  produces  affordable  life  accessories  that  are 
tightly  integrated  into  daily  life. 

The  impact  of  the  advances  in  digital  technology  and  the  economics  of  ICT  (Information 
and  Communication  Technology)  components  is  best  understood  in  the  increase  of  the 
size  of  the  so-called  'second  economy'  (Arthur,  2011)  in  the  macro  sense  and  in  the 
proliferation  of  social  media  in  the  micro  sense.  The  size  of  the  second  economy  in  the 
US,  which  is  defined  to  be  the  invisible  maze  of  routers,  switches,  servers,  cabling  and  so 
on,  is  expected  to  surpass  that  of  the  physical  economy  by  2025  1.  The  second  economy 
introduces  a  neural  system  underlying  the  physical  world  and  is  the  core  of  the  digital 
age. 

The  Internet  is  the  most  natural  place  where  the  second  economy  surfaces  into  reality 
and  it  is  where  social  media  take  place,  nowadays  supported  by  products  like  Facebook, 
Twitter,  Google+,  Linkedln  or  YouTube.  The  data  that  will  result  from  the  second 
economy,  the  Internet,  social  media  and  technology  miniaturization  will  be 
indispensable  for  social  sciences  and  will  more  than  complement  more  traditional  data 
sources  such  as  official  statistics. 

In  the  paper,  we  first  discuss  the  type  of  data  that  are  available  on  the  internet  and 
researchers  have  started  using  in  their  analyses.  We  reveal  the  particular  chances  and 
challenges  these  data  bring  to  deal  with  human  resources  questions.  We  then  introduce 


1 A  vivid  impression  of  how  extensive  the  second  economy  worldwide  is  may  be 
obtained  by  taking  a  look  at  the  maze  of  submarine  cables  wiring  the  planet  at 
http://www.submarinecablemap.com/. 
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key  literature  in  the  subfields  related  to  manpower  issues  in  the  social  sciences.  Finally, 
we  present  an  overview  of  the  contributions  of  this  special  issue. 


2.  Internet  data:  Chances  and  challenges 

In  the  1980s,  as  the  Internet  was  in  its  infancy,  social  scientists  first  saw  it  as  a  medium 
over  which  one  could  "build  and  field"2  surveys  with  ease,  in  an  unprecedented  scale, 
price  and  speed.  In  the  1990s,  the  Internet  started  entering  the  homes  and  everyday 
lives  of  individuals,  via  email  communication,  'surfing'  and  'askjeeves'  for  specific 
questions,  to  name  a  few  options.  In  the  2000s,  as  web  technologies  became  more 
involved,  via  increasingly  more  effective  techniques,  and  as  individuals  used  the  Internet 
more  intensively,  tons  of  data  just  started  piling  up.  These  data,  at  least  in  the  beginning, 
existed  without  even  people  knowing  it.  The  beauty  of  these  data  is  that,  unlike 
traditional  survey  data  that  are  collected  upon  the  consent  of  the  individual  and  may 
suffer  from  several  biases,  they  reveal  the  visceral  and  logical  choices  people  make  in 
the  privacy  of  their  home,  and  while  they  think  they  are  under  no  observation.  With  the 
entry  of  Google  in  the  industry  all  kinds  of  personal  information  "broke  lose";  now  these 
data  could  be,  for  example,  information  about  the  political  preferences  of  people  during 
an  election  year  under  recession,  or  about  the  xenophobia  of  natives  after  9/11. 

The  distributed  nature  of  the  Internet  allowed  surveyors  to  close  the  geographic  gap 
while  the  practically  zero  marginal  costs  of  email-  or  web-based  surveys  made  repeating 
surveys  possible  not  only  in  an  affordable  manner  but  also  in  an  unprecedented  cross- 
sectional  size,  frequency  and  scale.  Naturally,  the  Internet  as  a  survey  platform  brought 
with  it  both  new  potential  and  new  methodological  challenges.  However,  as  the  Internet 
becomes  ubiquitous  there  are  a  priori  and  in  principle  no  artificial  limits  in  creating 
truly  random  and  representative  data  samples.  At  the  same  time,  as  ICT  advance 
sampling  becomes  less  and  less  of  an  unavoidable  fact.  By  connecting  an  ever-larger  part 
of  the  population  we  progressively  eliminate  selection  bias  because  the  online 
population  tends  to  become  equal  to  the  general  population  thus  allowing  us  to  have 
truly  random  and  representative  samples,  at  least  when  there  is  full  access  to  the  data. 

At  the  same  time  progress  in  ICT  makes  sampling  unnecessary  since  we  are  able  to  deal 
with  practically  unlimited  amounts  of  data. 

A  prime  example  of  the  Internet  as  a  means  for  carrying  out  large-scale  surveys  is  the 
Wage  Indicator  Survey  of  the  Wage  Indicator  Foundation,3  which  provides  wages  on 
over  60  countries  in  over  20  languages,  producing  harmonized  and  hence  comparable 
data  on  wages  in  a  large  cross  section.  It  also  suffers  from  the  aforementioned  issues  of 
selection  bias,  which  need  more  research  in  order  to  get  a  handle  on  them.  Other  data 
sources  used  from  the  Internet  include  data  from  Google,  Wikipedia,  Facebook,  Twitter, 
Googleplus  and  Linkedln,  to  mention  a  few. 


2  The  story  of  this  early  development  is  nicely  told  in  (Das  et  al.,  2011). 

3  See  for  more  information  about  the  Wage  Indicator  Foundation: 

http:  //www.wageindicator.org/  .  The  data  bank  center  of  IZA,  IDSC,  hosts  the  Wage 
Indicator  Survey  and  makes  it  publicly  accessible. 
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As  the  contacting  of  surveys  over  the  Internet  matured  and  turned  into  an  indispensable 
instrument  the  proliferation  of  the  Internet  brought  about  even  more  data  which  do  not 
come  from  voluntary  surveys.  This  type  of  data  arise  because  more  of  what  used  to  be 
"offline  life”  is  being  transferred  online.  One  of  the  things  that  ICT  and  the  Internet  can 
do  well  is  to  reduce  frictions  in  any  kind  of  matching  task  in  almost  any  type  of  market 
hence  many  businesses,  which  deal  with  searching  and  matching  in  various  contexts  and 
markets  now  take  place  online.  Matching  is  fundamental  to  life  in  general  and  to 
economics  in  particular  and  a  large  portion  of  economics  research  is  dedicated  to 
understanding  matching  problems  and  finding  optimal  solutions. 

Whether  it  is  about  matching  city  passengers  to  taxis,  long  distance  travellers  to  airplane 
seats,4  individuals  in  the  marriage  market  (Hitsch  et  al.,  2010)  or  in  the  job  market 
(Kuhn,  2014)  the  Internet  has  reduced  search  frictions  and  hence  opened  new  business 
opportunities,  where  business  like  Online  Dating  Services  or  Job  Board  Services  flourish. 
Naturally,  this  creates  a  new  data  source  for  undending  matching  in  the  various  contexts 
of  economic  behavior  and  to  use  it  to  investigate  old  questions. 

In  reality,  the  Internet  has  replaced  many  labor  markets.  If  people  are  looking  for  a 
carpenter,  a  handyman,  a  medical  doctor  or  a  lawyer  they  go  online  and  they  type  these 
keywords.  "Free”  services  such  as  "zocdoc,"  "askjenny"  or  the  "craigslist"  immediately 
provide  people  with  hundreds  of  options.  Likewise,  employers  often  use  the  Internet  to 
screen  and  hire  employees  (maybe  via  the  career  network  Linkedln),  while  cutting 
down  on  hiring  costs  by  using  Skype.  The  power  of  the  Internet  revolution  was 
particularly  evident  during  the  global  financial  and  economic  crisis  of  2008-2011. 
Internet  data  show  us  that  the  unemployed  searched  the  web  extensively  to  find  a  job 
locally  or  globally. 

The  core  search-and-match  market  in  the  Internet  of  course  is  the  market  of  the  Internet 
search  engines  themselves,  which  match  the  supply  and  demand  for  documents.  They 
match  the  demand  for  information  with  the  supply  of  documents  that  contain  this 
information.  Google  Trends  data  are  an  interesting,  if  aggregate,  form  of  data  exactly 
because  of  that.  Knowing  how  demand  for  certain  types  of  information  varies  in  time 
reveals  information  about  the  state  of  the  agents  seeking  this  information.  This  fact  is  at 
the  core  of  Google’s  business  model  and  Google  Trends  provide  us  with  an  aggregate 
impression  of  this  demand.  This  is  at  the  core  of  our  own  work  with  Google  Trends  data 
(Askitas  and  Zimmermann  2009,  2011  &  2015)  and  technology  data  in  general  (Askitas 
and  Zimmermann,  2013). 

Internet  data  are  born  digital  so  it  is  easy  to  store,  manage  and  curate.  They  are  time- 
stamped  and  geo-tagged  allowing  in  principle  precise  measurements  in  the  longitudinal 
and  cross-sectional  dimensions  at  any  preferred  scale.  While  geo-location  is  today  still 
largely  based  on  the  geo-location  of  IP  addresses  (precise  only  at  the  country  level,  but 
increasingly  imprecise  as  we  drill  down  to  smaller  geographic  units)  it  will  eventually 
become  precise  as  we  substitute  IP  based  geo-location  with  true  GPS  (Global  Positioning 
System)  based  geo  location. 


4  For  example:  a  search  for  "taxi"  in  the  Travel  Category  of  the  Apple  App  Store  returns 
over  1200  apps,  while  a  search  for  "cheap  flights”  returns  290  apps. 

4 


New  data  will  come  to  exist  out  of  the  so-called  Internet  of  Things.  An  increasing  array  of 
affordable  embedded  sensors  will  transmit  precise  geo-located  measurements  on  time 
and  will  cover  everything  from  individual  vital  signs  to  the  emotional  state  and  well¬ 
being  of  individuals  as  well  as  to  any  economic  or  other  human  activity.  Obviously  these 
developments  will  make  the  economy  more  data  dependent  and  will  hence  open  new 
research  opportunities  but  also  bring  along  new  challenges.  On  the  one  hand  new 
technologies  and  their  recombination  produce  new  data  with  new  potential.  On  the 
other  hand  we  continue  to  inherit  some  of  the  old  issues  as  we  go  along. 

Selection  bias  continues  to  be  an  issue  due  to  the  fact  that  adoption  differs  and  will 
continue  to  do  so  with  every  technological  wave  both  across  countries  and  across 
individuals.  Internet  penetration  in  some  countries  is  now  over  90%  so  that  for  those 
countries  coverage  is  eliminated  as  a  potential  cause  of  selection  bias  but  others  are  still 
lagging  behind.  In  2013,  the  penetration  rate  has  been  95.0%  in  Norway,  86.2%  in 
Germany,  84.2%  in  the  US,  and  74.8%  in  Spain.5  However,  even  in  countries  in  which 
everybody  is  online,  not  everybody  uses  a  smart  phone  and  not  everybody  uses  social 
media  so  that  any  research  with  new  data  whose  origin  includes  smart  phone 
technology  or  social  media  may  still  suffer  from  selection  bias.  To  be  clear:  Selection  bias 
is  not  automatically  ruled  out  by  coverage,  for  instance  when  there  is  only  access  to  the 
data  of  users  who  made  a  particular  choice.  Furthermore,  low  coverage  may  not  rule  out 
representativeness,  this  is  the  basis  of  traditional  surveys,  and  one  has  to  examine  how 
representative  Internet  data  are. 

A  natural  concern  by  economists  is  how  to  adequately  record  and  measure  all  these 
transactions  that  take  place  via  the  Internet.  Some  raised  the  issue  that  while  the 
Internet  dramatically  increased  the  welfare  of  consumers  who  receive  information  and 
entertainment  for  free  since  2000,  this  is  not  well  reflected  in  the  GDP  numbers. 
Facebook  for  example,  produces  advertising  supported  media  as  does  Google,  YouTube, 
etc.  While  this  is  not  a  problem  yet,6  it  can  be  a  challenge  in  the  future  as  new 
technology,  gadgets  and  services  emerge  through  the  Internet. 

Two  of  the  biggest  challenges  we  face,  as  technology  makes  all  this  possible  are  data 
privacy  issues  and  data  custody  and  ownership  issues.  As  a  society  we  will  need  to 
remain  vigilant  in  both  protecting  privacy  as  well  as  in  reforming  the  normative,  ethical 
and  legal  framework  within  which  to  discuss  it.  We  need  to  strengthen  the  institutional 
structures  to  keep  markets  contestable  to  avoid  a  misuse  of  the  monopoly  of  data  by  a 
handful  of  firms,  which  is  vital  for  society.  It  is  also  problematic  that  the  data  are 
unavailable  to  the  common  good  because  they  are  locked  up  in  proprietary  data  silos. 
Finally,  a  question  is  related  to  the  government’s  use  of  data  about  its  citizens  and  the 
limits  of  this  use.7 


5  Source:  Internet  World  Stats,  http://www.internetworldstats.com/top25.htm. 

6  In  the  US,  the  current  System  of  National  Accounts  2008  that  measures  GDP  includes 
the  value  added  of  producing  advertising  supported  media.  However,  the  output  is 
treated  as  an  intermediate  input  to  other  industries  rather  than  as  a  final  household 
consumption  (Soloveichik,  2015). 

7  Going  after  tax  evaders,  the  Greek  tax  authorities  in  2010  used  Google  maps  to  locate 
houses  with  swimming  pools  that  should  have  been  taxed.  They  managed  to  increase 
the  number  of  taxable  swimming  pools  in  the  Athens  suburbs  from  the  originally 
believed  324  to  16,974  swimming  pools.  This  is  a  prime  example  of  both  data  privacy 
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3.  A  first  guide  to  relevant  literature 


Internet  data  can  successfully  be  applied  to  a  wide  range  of  human  resource  issues 
including  forecasting  (e.g.  of  unemployment,  consumption  goods,  tourism,  festival 
winners  and  the  like),  nowcasting  (obtaining  relevant  information  much  earlier  than 
through  traditional  data-collection  techniques),  detecting  health  issues  and  well-being 
(e.g.  flu,  malaise  and  ill-being  during  economic  crises),  documenting  the  matching 
process  in  various  parts  of  the  individual  life  (e.g.,  jobs,  partnership,  shopping, 
preferences),  and  measuring  complex  processes  where  traditional  data  have  known 
deficits  (e.g.  international  migration,  collective  bargaining  agreements  in  developing 
countries). 

In  the  sequel,  we  provide  information  on  some  literature  that  has  used  Internet  data  in 
the  context  of  human  resources  within  the  social  sciences.  The  early  contributions  have 
applied  Google  activity  data;  among  them,  we  find  Constant  and  Zimmermann  (2008), 
Ginsberg  et  al.  (2009),  Askitas  and  Zimmermann  (2009)  and  Choi  and  Varian  (2009).  An 
exception  is  the  paper  by  Ettredge  et  al.  (2005)  that  used  data  from  the  WordTracker’s 
Top  500  Keyword  Report. 

3.1  Analyzing  and  predicting  unemployment 

Predicting  the  present  has  been  a  particular  challenge  during  the  Great  Recession, 
where  short-term  information  on  unemployment  was  badly  needed  but  unavailable.  The 
seminal  paper  by  Askitas  and  Zimmermann  (2009)  was  the  first  to  demonstrate  strong 
correlations  between  particular  Google  keyword  searches  and  monthly  German 
unemployment  rates.  The  authors  then  used  the  observed  structure  to  predict 
unemployment  behavior  under  the  complex  and  changing  circumstances  of  the  up-rise 
of  the  crisis. 

This  type  of  exercise  has  been  replicated  and  extended  for  various  other  countries.  It 
turned  out  that  research  demonstrates  that  there  is  additional  information  content  of 
Google  or  similar  Internet  activity  data  over  alternative  time-series  models  or  other 
business  cycle  indicators.  Such  studies  have  been  conducted  for  the  US  (Choi  and  Varian, 
2009  and  2012,  claims  for  unemployment  benefits),  France  (Fondeur  and  Karame,  2013, 
unemployment  rates),  Italy  (D'Amuri,  2009,  unemployment  rate),  Spain  (Vicente  et  al., 
2015,  unemployment  levels),  the  UK  (McLaren  and  Shanbhogue,  2011,  unemployment 
rates),  Ukraine  (Oleksandr,  2010,  unemployment  levels,  but  no  acceptable 
performance),  Israel  (Suhoy,  2009,  unemployment  rates),  Norway  (Anvik  and  Gjelstad, 
2010,  unemployment  rates)  and  China  (Su,  2014,  unemployment  internet  search 
indicators  from  Baidu  and  Google,  significant  correlation  with  Purchasing  Managers’ 
Employment  Indices),  among  others. 

Before  Google  activity  data  became  available,  Ettredge  et  al.  (2005)  were  able  to  utilize 
Internet  search  engine  keyword  usage  data  recorded  in  the  WordTracker’s  Top  500 


invasion  of  the  individuals  and  of  benefits  for  the  collective. 

(http://www.spiegel.de/international/europe/finding-swimming-pools-with-google- 

earth-greek-government-hauls-in-billions-in-back-taxes-a-709703.html) 
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Keyword  Report  published  weekly  by  Rivergold  Associates  Ltd  covering  the  Web’s 
largest  meta-search  engines.  Providing  an  unbiased  view  of  searches,  they  exploited  six 
terms  they  thought  would  be  mostly  used  by  job  seekers  to  predict  US  unemployment 
rates;  namely  job  search,  jobs,  monster.com,  resume,  employment,  and  job  listings.  The 
authors  concluded  that  their  findings  would  provide  an  encouraging  correlation.  Note 
that  when  we  did  our  analysis  (Askitas  and  Zimmermann,  2009),  we  were  not  aware  of 
the  Ettredge  et  al.  (2005)  study. 

3.2  Other  forecasting,  nowcasting  and  proxying 

A  variety  of  research  has  studied  the  predictive  properties  of  Internet  data,  in  particular 
Google  activity  data,  for  current  and  future  macroeconomic  variables  beyond 
unemployment  and  the  labor  market.  Investigating  business  cycles  is  one  well- 
researched  area:  Askitas  and  Zimmermann  (2013)  demonstrate  that  the  German 
business  cycle  can  be  nowcasted  by  highway  Toll  data.  Chen  et  al.  (2015)  find  that 
Google  search  volume  data  help  improve  the  timeliness  of  business  cycle  turning  point 
identification  during  the  2007-2008  US  recession. 

Another  example  is  aggregate  consumer  behavior:  Choi  and  Varian  (2012)  use  Google 
activity  data  for  the  US  for  automobile  sales,  travel  destination  planning  and  consumer 
confidence.  Carriere-Swallow  and  Labbe  (2013)  show  that  Google  search  queries  of 
automobile  purchases  in  Chile  improve  the  fit  and  efficiency  of  nowcasting  automobile 
sales  and  are  better  at  identifying  turning  points,  although  Internet  use  has  been  still 
low  in  Chile.  In  a  study  of  the  US  housing  market  for  2006-2011,  Askitas  and 
Zimmermann  (2011)  evaluate  search  intensity  data  for  "hardship  letter"  from  Google 
Insights  to  detect  ensuing  mortgage  delinquencies.  Other  studies  are  on  food  stamps 
data  in  the  US  (Fantazzini,  2014),  private  consumption  (Kholodilin  et  al.,  2010  for 
Germany;  Vosen  and  Schmidt,  2011,  for  the  US)  and  hotel  demand  from  web  traffic  data 
(Yang  et  al.,  2014).  Vosen  and  Schmidt  (2011)  show  that  in  almost  ah  of  their  forecasting 
experiments  a  Google  search  activity  indicator  outperforms  well-known  survey-based 
indicators. 

Saiz  and  Simonsohn  (2013)  suggest  to  systematically  use  internet  data  to  proxy 
unobservable  variables  and  demonstrate  the  usefulness  of  this  technique  for  a  selection 
of  occurrence  frequencies  of  crucial  social  phenomena  in  the  US. 

3.3  Health  issues 

The  early  innovative  study  by  Ginsberg  et  al.  (2009)  used  Google  activity  data  to  study 
the  influenza  epidemic  process,  applying  complex  computational  methods.  While  this 
study  has  received  large  attention,  it  was  also  found  that  the  structure  identified 
between  the  searches,  the  defined  keywords  and  the  individual  behavior  was  under 
change  (Lazer  et  al.,  2014).  This  guides  us  to  the  understanding  that  models  have  to  be 
adjusted  over  time. 

Another  issue  of  recent  concern  is  the  relationship  between  seasonality,  business  cycles 
and  mental  health.  Yang  et  al.  (2010)  using  US  Google  data  present  for  the  first  time  on 
aglobal  level  evidence  that  the  incidence  of  depression  varies  seasonally.  Tefft  (2011) 
studies  the  relationship  between  unemployment  indicators  and  Google  searches  for 
depression  and  anxiety  in  the  US  around  the  time  of  the  Great  Recession.  His  empirical 
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evidence  implies  that  unemployment  and  continued  unemployment  insurance  claims 
are  positively  correlated  with  searches  for  depression,  while  initial  unemployment 
insurance  claims  are  negatively  related  with  searches  for  depression  and  anxiety. 

3.4  Matching  on  the  labor  market 

As  the  Internet  becomes  more  and  more  the  place  where  demand  and  supply  meet,  it 
may  soon  overshadow  the  markets’  role  as  a  source  of  information.  This  matching  role  is 
even  more  important  for  the  labor  markets;  experts  expect  the  internet  to  become  the 
dominant  platform  of  exchange  (Kuhn,  2014).  Online  Job  Boards,  of  rising  importance  in 
the  real  world,  are  being  heavily  used  by  researchers  (see  Askitas  and  Zimmermann, 
2009)  and  they  improve  matching  in  the  Job  Market  (Kuhn  and  Mansour,  2014).  Internet 
Job  Boards  not  only  reduce  search  frictions,  but  they  are  even  useful  in  documenting  and 
studying  artificial  frictions  such  as  discrimination  (Kuhn  and  Shen,  2012;  Maurer-Fatzio, 
2012).  Kurekova  et  al.  (2014)  provide  an  insightful  study  of  the  methodological 
challenges  that  result  from  the  new  online  job  vacancy  data  and  voluntary  web-based 
surveys  provided  by  those  platforms. 

3.5  Demographic  issues 

Our  understanding  of  migratory  processes  is  still  limited,  one  of  the  reasons  being  that 
the  available  data  are  insufficient  since  international  migrants  are  not  well  covered  in 
traditional  national  surveys.  A  number  of  researchers  in  social  sciences  have  explored 
the  use  of  Internet  data  to  close  gaps.  An  example  is  Reips  and  Buffardi  (2012)  who  are 
examining  social  networking  Web  pages  to  study  migrant  biculturalism.  Billari  et  al. 
(2013)  used  Google  search  queries  like  birth  and  pregnancy  to  predict  fertility  measures 
with  some  success. 

Hitsch  et  al.  (2010)  and  Belou  (2015)  are  prominent  research  examples  on  the  rising 
role  of  the  Internet  on  marriage  markets.  Employing  data  on  user  attributes  and 
interactions  from  an  online  dating  site,  Hitsch  et  al.  (2010)  estimate  mate  preferences 
and  are  able  to  predict  stable  matches.  Bellou  (2015)  shows  that  in  the  US,  marriage 
rates  in  age  groups  that  are  more  likely  to  act  online  have  increased  due  to  the  diffusion 
of  the  Internet,  which  has  even  substituted  other  matching  methods  (such  as  via  family 
and  friends).  In  fact,  one  of  the  driving  factors  of  reduced  divorce  rates  is  better  (online) 
matching. 

3.6  Political  processes 

Internet  data  can  be  considered  as  a  universe  of  answers  to  questions  that  were  not  yet 
posed.  In  an  early  contribution  to  the  literature,  Constant  and  Zimmermann  (2008)  used 
Google  search  engine  query  data  to  measure  economic  and  political  activities  by 
documenting  the  evolution  of  particular  keyword  searches  (financial  crisis,  credit 
crunch,  recession,  unemployment  rate)  and  offered  a  good  impression  of  which  topics 
influenced  the  US  presidential  elections.  Stephens-Davidowitz  (2014)  employed  Google 
activity  data  for  the  US  involving  racially  charged  language.  Contrary  to  previous 
attempts  with  standard  survey  data  he  found  solid  evidence  with  these  data  that  racial 
animus  had  cost  Barack  Obama  votes  as  Presidential  candidate  in  2008.  Also  Reilly  et  al. 
(2012)  used  Google  data  for  state  politics  research  in  the  context  of  the  2008 
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Presidential  election;  here  by  successfully  relating  the  searches  with  actual  engagement 
in  the  ballot  measures. 

Ripberger  (2011)  compared  for  the  US  the  public  attention  of  various  political  issues 
(health  care,  global  warming,  and  terrorism)  measured  by  Google  activity  data  with 
issue  coverage  in  the  New  York  Times  and  found  high  correlation  and  validity. 

4.  This  special  issue 

In  this  issue  of  the  International  Journal  of  Manpower  we  put  together  a  set  of  six 
carefully  selected  articles  contributing  to  the  rising  use  of  Internet  data  from  different 
sources  and  for  a  variety  of  purposes. 

The  paper  by  Emilio  Zagheni  and  Ingmar  Weber  on  "Demographic  Research  with  Non- 
Representative  Internet  Data"  (Zagheni  and  Weber,  2015)  addresses  the  two  most 
critical  methodological  issues  in  the  use  of  internet  data:  non-representativeness  and 
selection  bias.  It  proposes  a  framework  to  collect  web  data  and  discusses  possible 
estimation  methods,  while  it  also  makes  clear  that  there  are  still  some  large  challenges 
to  address.  The  paper  also  surveys  relevant  demographic  literature,  in  particular  in  the 
area  of  migration,  where  useful  data  about  the  mobility  process  are  typically  scarce  in 
the  traditional  data  sources. 

Two  papers  study  well-being  from  different  data  sources.  Nikolaos  Askitas  and  Klaus  F. 
Zimmermann  are  examining  "Health  and  Well-Being  in  the  Great  Recession"  (Askitas 
and  Zimmermann,  2015)  using  Google  activity  data  to  trace  and  document  the  impact  of 
the  2008  Financial  and  Economic  Crisis  on  well-being.  They  are  able  to  confirm  previous 
knowledge  from  the  economics  of  health,  well-being  and  the  business  cycle.  Martin  Guzi 
and  Pablo  de  Pedraza  in  their  article  "A  Web  Survey  Analysis  of  Subjective  Well-being" 
(Guzi  and  de  Pedraza,  2015)  employ  data  from  the  voluntary  web-survey  Wagelndicator 
project.  They  confirm  that  job  characteristics  affect  job  satisfaction  and  identify 
spillovers,  since  satisfaction  in  one  domain  affects  other  domains. 

Margaret  Maurer-Fazio  and  Lei  Lei  study  the  Chinese  Internet  job  board  labor  market  in 
their  paper  "As  Rare  as  a  Panda":  How  Facial  Attractiveness,  Gender,  and  Occupation 
Affect  Interview  Callbacks  at  Chinese  Firms"  (Maurer-Fazio  and  Lei,  2015).  They 
examine  in  a  resume  audit  (correspondence)  study,  how  discrimination  derived  from 
gender  and  facial  attractiveness  varies  across  occupation,  location,  and  firms’  ownership 
type  and  size.  They  find  that  women  are  generally  preferred  to  men  and  unattractive  job 
candidates  have  a  disadvantage. 

In  their  paper  "Comparing  Collective  Bargaining  Agreements  for  Developing  Countries", 
Janna  Besamusca  and  Kea  Tijdens  employ  for  the  first  time  the  web-based 
Wagelndicator  Collective  Bargaining  Agreement  Database  for  11  developing  countries 
(Besamusca  and  Tijdens,  2015).  They  find  that  few  agreements  specify  wage  levels,  but 
almost  all  collective  agreements  have  clauses  on  wages.  Their  study  also  documents 
working  hours,  paid-leave  arrangements  and  work-family  arrangements. 

The  final  paper  by  Concha  Artola,  Fernando  Pinto  and  Pablo  de  Pedraza  entitled  "Can 
Internet  Searches  Forecast  Tourism  Inflows?"  represents  the  large  literature  on  using 
internet  data  for  forecasting  purposes  (Artola,  Pinto  and  de  Pedraza,  2015).  Employing 
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Google  activity  data,  the  authors  demonstrate  that  traditional  time-series  forecasting 
models  for  tourism  inflows  into  Spain  can  be  improved  using  Google  activity  measures. 

These  contributions  are  an  earnest  attempt  to  evaluate  the  potentials  of  the  new  data 
sources,  and  an  encouragement  for  their  further  use.  Research  has  to  go  beyond  the 
current  borders,  both  in  terms  of  methodological  developments,  as  in  terms  of  practical 
use  of  the  data  in  concrete  applications  in  all  areas  of  the  social  sciences. 


5.  Conclusion 

Internet  data  will  record  a  large  part  of  our  life  and  will  become,  at  least  to  some  extent, 
part  of  the  research  in  the  social  sciences.  Already  today,  there  is  a  rising  and  successful 
use  of  proxying,  nowcasting  or  forecasting  reality,  explaining  behavior  and  modeling 
market  processes.  It  seems  that  there  is  a  large  potential  to  push  the  research  frontier 
ahead.  A  problem  is  still  the  coverage  of  the  data  and  the  limited  techniques  we  have 
available  to  deal  with  the  application  challenges.  Ahead  of  us  is  the  collection  of  "Big 
Data";  therefore,  we  may  not  be  too  much  afraid  of  selection  bias  in  the  future.  However, 
this  expectation  is  based  on  the  free  and  complete  access  to  the  data.  Privacy  issue  may 
limit  a  fast  solution.  Nevertheless,  we  think  that  with  the  rise  of  the  Internet  of  Things 
we  will  also  see  not  too  far  ahead  a  further  revolution  in  applied  research  in  the  social 
sciences. 
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